Golang Job: Site Reliability Engineer - Data Platforms (strong

Job added on

Company

Artech, LLC
United States of America

Location

Remote Position
(From Everywhere/No Office Location)

Job type

Full-Time

Golang Job Details

Description
Data Platform Site Reliability Engineering manages infrastructure and applications on bare-metal and cloud computing platforms to deliver data processing, governance, and storage for many of our global products and organizations. Our platform teams work with exabytes of data, terabytes of memory, and hundreds of thousands of jobs running millions of executions to support predicable and performant data analytics enabling features in our world class products. Ensuring all of these technologies in geographically distributed data centers and platforms work together in harmony presents unique challenges. As an SRE, you ll need to solve problems that arise using empirical data, teamwork, and your own unique expertise.
Data Platform Services SRE work directly with our partner engineering teams in an embedded SRE model, operating in unison with the developers to deliver seamless experiences for our customers. We run a combination of open source, vendor licensed, and internally developed tools which you will use and have opportunities to improve upon. The cross functional team collaborates to ensure we apply a consistent incident management process across all data platform services and provide user journey based SLOs derived from exhaustive observability metrics, high availability architecture, and automation for deployments. We think critically and strive to balance the best solution with the need to get things done for each engineering challenge we face.
Key Qualifications

  • Strong sense of ownership and integrity demonstrated through clear communication and collaboration
  • Experience operating production applications at scale, including detailed performance testing, HA and disaster recovery concepts, capacity planning, and managing distributed systems on internal and public cloud infrastructure, principally Kubernetes
  • Proficiency in authoring and releasing code in Go, Python, Java, or Scala using common configuration management and software delivery platforms
  • Proficiency with the architecture, deployment, performance tuning, and troubleshooting of open-source Big Data technologies, especially Apache Spark, Flink, AirFlow, Hive, Hadoop/HDFS, Trino, Druid, or related software
  • Experience with storage and coordination systems such as Apache Cassandra, Zookeeper, etcd, Redis, as well as blob and block storage technologies
  • The successful candidate is frustrated with toil and has an acute drive to both automate manual operations and evolve them into automatic processes
  • Understanding of the Linux Operating System, containers and virtualization, standard networking protocols, and components
  • Demonstrates excellent troubleshooting and problem-solving skills using the scientific method
  • Ability to participate in our 24x7 weekly on-call rotation

Skills

  • Kubernetes, Amazon EKS, and/or GKE
  • Python, Golang, Scala, and/or Java comprehension and development experience
  • Manual process automation through innovative tools
  • High-Availability Architecture
  • Big Data Processing and/or Data Governance; Apache Spark, Hive, Hadoop/HDFS, Trino, Druid, etc...
  • Fault troubleshooting of virtual and/or containerized services through the stack
  • Developing and managing CI/CD or software delivery pipelines
  • Infrastructure-as-Code orchestration tools, such as terraform or Pulumi
  • Agile methodology

Education or Experience

  • BS/MS in Computer Science or Equivalent (5+ years of software development or production operations experience in a large-scale environment)
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.